A bank need to have a way to decide if/when a customer can be granted a loan.
A doctor may need to decide if an oncological patient has to undergo a surgery or a less aggressive treatment.
A company may need to decide about investing in new technologies or stay with the traditional ones.
In all those cases a decision tree may provide a structured approach to decision-making that is based on data and can be easily explained and justified.
A decision tree is a graphical representation of a series of decisions and their potential outcomes.
It is obtained by recursively stratifying or segmenting the feature space into a number of simple regions.
Each region (decision) corresponds to a node in the tree, and each potential outcome corresponds to a branch.
The tree structure can be used to guide decision-making based on data.
We need context and also, we need to learn how to build, evaluate and optimize trees.
Context
When is it appropriate to rely on decision trees?
When would other approaches be preferable?
What type of decision trees can be used?
Build the tree
Two main types: Classification and Regression Trees (CART).
Classification Trees are built when the response variable is categorical.
Regression Trees are used when the response variable is numerical. - They aim to predict the value of a continuous response variable based on the values of the predictor variables.
| Aspect | Regression Trees | Classification Trees |
|---|---|---|
| Outcome var. type | Continuous | Categorical |
| Goal | To predict a numerical value | To predict a class label |
| Splitting criteria | Mean Squared Error, Mean Abs. Error, etc. | Gini Impurity, Entropy, etc. |
| Leaf node prediction | Mean or median of the target variable in that region | Mode or majority class of the target variable ... |
| Tree visualization | Sequence of splits leading to a numerical predic... | Sequence of splits leading to a categorical pr... |
| Examples of use cases | Predicting housing prices, predicting stock prices | Predicting customer churn, predicting the like... |
| Evaluation metric | Mean Squared Error, Mean Absolute Error, R-square... | Accuracy, Precision, Recall, F1-score, etc. |
| Overfitting | Can suffer from overfitting if the tree is too de... | Can suffer from overfitting if the tree is too ... |
| Pruning | Can be pruned to reduce overfitting | Can be pruned to reduce overfitting |
| Ensemble methods | Random Forest, Gradient Boosting, etc. | Random Forest, Gradient Boosting, etc. |
Building the tree requires deciding:
Evaluation is similar to other classifiers.
Optimization involves deciding:
| Package | Algorithm | Dataset size | Missing data handling | Ensemble methods | Visual repr | User interface |
|---|---|---|---|---|---|---|
rpart |
RPART | Medium to large | Poor | No | Yes | Simple |
caret |
Various | Various | Depends on algorithm | Yes | Depends on algorithm | Complex |
tree |
CART | Small to medium | Poor | No | Yes | Simple |
Rows: 768
Columns: 9
$ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1…
$ glucose <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,…
$ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 80, 60, 72, N…
$ triceps <dbl> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, NA, 23, 19, N…
$ insulin <dbl> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA, NA, 846, 17…
$ mass <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, NA, 37.…
$ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158…
$ age <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3…
$ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n…
A node is denoted by \(t\). We will also denote the left child node by \(t_{L}\) and the right one by \(t_{R}\).
Denote the collection of all the nodes in the tree by \(T\) and the collection of all the leaf nodes by \(\tilde{T}\)
A split will be denoted by \(s\). The set of splits is denoted by \(S\).
The tree represents the recursive splitting of the space.
In the end, every leaf node is assigned with a class and a test point is assigned with the class of the leaf node it lands in.
It involves the following three elements:
The selection of the splits, i.e., how do we decide which node (region) to split and how to split it?
If we know how to make splits (‘grow’ the tree), how do we decide when to declare a node terminal and stop splitting?
How do we assign each terminal node to a class?
To build a Tree, questions have to be generated that induce splits based on the value of a single variable.
Ordered variable \(X_j\):
Categorical variables, \(X_j \in \{1, 2, \ldots, M\}\):
The pool of candidate splits for all \(p\) variables is formed by combining all the generated questions.
The way we choose the split, is to measure every split by a ‘goodness of split’ measure, which depends on:
The ‘goodness of split’ in turn is measured by an impurity function.
Intuitively, when we split the points we want the region corresponding to each leaf node to be “pure”, that is, most points in this region come from the same class, that is, one class dominates.
Purity not increased
Purity increased
Definition: An impurity function is a function \(\Phi\) defined on the set of all \(K\)-tuples of numbers \(\left(p_{1}, \cdots, p_{K}\right)\) satisfying \(p_{j} \geq 0, \quad j=1, \cdots, K, \Sigma_{j} p_{j}=1\) with the properties:
Commonly used impurity functions for classification trees:
If \(p_{j}=0\), use the limit \(\lim p_{j} \rightarrow \log p_{j}=0\).
Misclassification rate: \(1-\max _{j} p_{j}\).
Gini index: \(\sum_{j=1}^{K} p_{j}\left(1-p_{j}\right)=1-\sum_{j=1}^{K} p_{j}^{2}\).
\[ i(t)=\phi(p(1 \mid t), p(2 \mid t), \ldots, p(K \mid t)) \]
where \(p(j \mid t)\) is the estimated posterior probability of class \(j\) given a point is in node \(t\).
\[ \Phi(s, t)=\Delta i(s, t)=i(t)-p_{R} i\left(t_{R}\right)-p_{L} i\left(t_{L}\right) \]
\[ H(t)=-\sum_{i=1}^{n} P\left(t_{i}\right) \log _{2} P\left(t_{i}\right) \]
From here, an information gain (that is impurity decrease) measure can be introduced.
Information theoretic approach that compares
\[ \begin{aligned} & IG(t, s)=\text { (original entropy) }-(\text { entropy after the split) } \\ & IG(t, s)=H(t)-\sum_{i=1}^{n} \frac{\left|t_{i}\right|}{t} H\left(x_{i}\right) \end{aligned} \]
Consider the problem of designing an algorithm to automatically differentiate between apples and pears (class labels) given only their width and height measurements (features).
| Width | Height | Fruit |
|---|---|---|
| 7.1 | 7.3 | Apple |
| 7.9 | 7.5 | Apple |
| 7.4 | 7.0 | Apple |
| 8.2 | 7.3 | Apple |
| 7.6 | 6.9 | Apple |
| 7.8 | 8.0 | Apple |
| 7.0 | 7.5 | Pear |
| 7.1 | 7.9 | Pear |
| 6.8 | 8.0 | Pear |
| 6.6 | 7.7 | Pear |
| 7.3 | 8.2 | Pear |
| 7.2 | 7.9 | Pear |
Maximizing information gain is one possible criteria to choose among splits.
In order to avoid excessive complexity it is usually decided to stop splitting when “information gain does not compensate for increase in complexity”.
In practice stop splitting if: \[ \max _{s \in S} \Delta I(s, t)<\beta \]
The decision tree classifies new data points as follows.
\[ \kappa(t)=\arg \max _{j} p(j \mid t) \]
Let’s assume we have built a tree and have the classes assigned for the leaf nodes.
Goal: estimate the classification error rate for this tree.
We need to introduce the resubstitution estimate \(r(t)\) for the probability of misclassification, given that a case falls into node \(t\). This is:
\[ r(t)=1-\max _{j} p(j \mid t)=1-p(\kappa(t) \mid t) \]
Denote \(R(t)=r(t) p(t)\), that is the miscclassification error rate weighted by the probability of the node.
The resubstitution estimation for the overall misclassification rate \(R(T)\) of the tree classifier \(T\) is:
\[ R(T)=\sum_{t \in \tilde{T}} R(t) \]
Trees obtained by looking for optimal splits tend to overfit: good for the data in the tree, but generalize badly and tend to fail more in predictions.
In order to reduce complexity and overfitting while keeping the tree as good as possible tree pruning may be applied.
Essentially pruning works by removing branches that are unlikely to improve the accuracy of the model on new data.
\[ R_\alpha(T) =R(T)+\alpha|T| \] where \(\alpha\) is the parameter that controls the trade-off between tree complexity and accuracy.
Start by building a large tree that overfits the data.
Then, use cross-validation to estimate the optimal value of alpha that minimizes the generalization error.
Finally, prune the tree by removing the branches that have a smaller improvement in impurity than the penalty term multiplied by alpha.
Iterate the process until no more branches can be pruned, or until a minimum tree size is reached.
When the response variable is numeric, decision trees are regression trees.
Option of choice for distinct reasons
airquality dataset from the datasets package contains daily air quality measurements in New York from May through September of 1973 (153 days).where \(\hat{y}_{R_j}\) is the mean response for the training observations within the \(j\) th box.
and seek the value of \(j\) and \(s\) that minimize the equation:
\[ \sum_{i: x_i \in R_1(j, s)}\left(y_i-\hat{y}_{R_1}\right)^2+\sum_{i: x_i \in R_2(j, s)}\left(y_i-\hat{y}_{R_2}\right)^2. \]
Once the regions have been created we predict the response using the mean of the trainig observations in the region to which that observation belongs.
In the example, for an observation belonging to the shaded region, the prediction would be:
\[ \hat (y) =\frac{1}{4}(y_2+y_3+y_5+y_9) \]
set.seed(123)
train <- sample(1:nrow(aq), size = nrow(aq)*0.7)
aq_train <- aq[train,]
aq_test <- aq[-train,]
aq_regresion <- tree::tree(formula = Ozone ~ .,
data = aq_train, split = "deviance")
summary(aq_regresion)
Regression tree:
tree::tree(formula = Ozone ~ ., data = aq_train, split = "deviance")
Variables actually used in tree construction:
[1] "Temp" "Wind" "Solar.R" "Day"
Number of terminal nodes: 8
Residual mean deviance: 285.6 = 21420 / 75
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-58.2000 -7.9710 -0.4545 0.0000 5.5290 81.8000
is as small as possible.
Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\).
Use K-fold cross-validation to choose \(\alpha\). That is, divide the training observations into \(K\) folds. For each \(k=1, \ldots, K\) :
Average the results for each value of \(\alpha\). Pick \(\alpha\) to minimize the average error.
cv_aq <- tree::cv.tree(aq_regresion, K = 5)
optimal_size <- rev(cv_aq$size)[which.min(rev(cv_aq$dev))]
aq_final_tree <- tree::prune.tree(
tree = aq_regresion,
best = optimal_size
)
summary(aq_final_tree)
Regression tree:
tree::tree(formula = Ozone ~ ., data = aq_train, split = "deviance")
Variables actually used in tree construction:
[1] "Temp" "Wind" "Solar.R" "Day"
Number of terminal nodes: 8
Residual mean deviance: 285.6 = 21420 / 75
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-58.2000 -7.9710 -0.4545 0.0000 5.5290 81.8000
In this example pruning does not improve the tree.
Trees are very easy to explain to people.
Decision trees may be seen as good mirrors of human decision-making.
Trees can be displayed graphically, and are easily interpreted even by a non-expert.
Trees can easily handle qualitative predictors without the need to create dummy variables.
Trees generally do not have the same level of predictive accuracy as sorne of the other regression and classification approaches.
Additionally, trees can be very non-robust: a small change in the data can cause a large change in the final estimated tree.
Efron, B., Hastie T. (2016) Computer Age Statistical Inference. Cambridge University Press. Web site
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer. Web site
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC press.
Brandon M. Greenwell (202) Tree-Based Methods for Statistical Learning in R. 1st Edition. Chapman and Hall/CRC DOI: https://doi.org/10.1201/9781003089032
Genuer R., Poggi, J.M. (2020) Random Forests with R. Springer ed. (UseR!)
Applied Data Mining and Statistical Learning (Penn Statte-University)
An Introduction to Recursive Partitioning Using the RPART Routines